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Abstract —With the rapid increase of transnational communi¬ 
cation and cooperation, people frequently encounter multilingual 
scenarios in various situations. In this paper, we are concerned 
with a relatively new problem: script identification at word or 
line levels in natural scenes. A large-scale dataset with a great 
quantity of natural images and 10 types of widely-used languages 
is constructed and released. In allusion to the challenges in 
script identification in real-world scenarios, a deep learning based 
algorithm is proposed. The experiments on the proposed dataset 
demonstrate that our algorithm achieves superior performance, 
compared with conventional image classification methods, such 
as the original CNN architecture and LLC. 1 

I. Introduction 
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Automatic script identification is a task that facilitates many 
important applications in both document analysis and natural 
scene text recognition. Previous work mostly focuses on script 
identification in documents [1]—[3] or videos [4], [5]. In 
document, script identification can be done at page/paragraph 
level [6], text line level [7], word level [8] or character 
level [9]. Tan [1] investigated the properties of a group of 
rotation invariant texture features and used these features to 
recognize the language type of characters in machine printed 
document. Hochberg et al. [2] propose a script identification 
system for characters stored electronically in image form. 

In this paper, we are concerned with a relatively new 
problem: script identification at word or line levels in natural 
scenes. Identifying script in natural scene is an important 
task, particularly to text reading systems under multilingual 
scenarios [10]. Naturally, this problem can be casted as an 
image classification problem, which has been studied ex¬ 
tensively [11], [12]. Nevertheless, it remains a challenging 
problem, mainly due to four reasons: (1) Characters in the 
wild are usually with higher variability in font, color and 
layout. (2) Backgrounds in natural scenes are more complex 
and may contain clutter or noise. (3) Different scripts may 
share a subset of alphabets, as illustrated in Figure 1. This 
phenomenon makes it difficult to distinguish among different 
types of languages solely from appearance. (4) Text images 
are in arbitrary lengths, ruling out some classification methods 
that only operate on fixed sized inputs. To deal with the chal¬ 
lenges, we propose a deep leaning based unified framework 
to recognizing scripts in the wild. Towards this end, we make 
the following contributions: 


Fig. 1. Illustration of scripts that share subsets of alphabets. Characters “A”, 
“B” and “E” appear in all these three scripts. The script identification relies 
on special characters that are unique to particular scripts. 

• We establish a large-scale benchmark for algorithm devel¬ 
opment and comparison. The benchmark includes 13045 
word images, cropped from 7700 full images taken in 
diverse real-world scenarios. The scripts in the images 
are from 10 different languages. 

• To tackle the challenges described above, we propose a 
deep leaning based algorithm, which could serve as the 
baseline algorithm in future research. 

• Compared with other conventional image classification 
methods, our approach better exploits the characteristics 
of texts in natural images and obtains superior perfor¬ 
mance when evaluated on the proposed benchmark. 

II. The SIW-10 Dataset 

There exist several public datasets that consist of natural 
images with texts, for instance, ICDAR 2011 [13], SVT [14] 
and HIT 5K-Word [15]. However, these datasets are primarily 
used for scene text detection and recognition tasks. Besides, 
they are dominated by English or other Latin-based languages. 
In the area of script identification, there exists datasets [2], [4], 
[5]. But these databases only include characters in document 
images or videos, rather than natural images. 

Therefore, we propose a new dataset for script identification 
in wild scenes 2 . The dataset contains multi-scripts images 
that are taken from natural scenes images (Figure 2). As 
illustrated in Figure 3, the database includes text images 


l This paper has been submitted to International Conference on Document 
Analysis and Recognition (ICDAR) 2015 and is now under peer-review. 


2 We will release the dataset (cropped images and full sized images) for 
academic use. 
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Fig. 3. Some examples in the SIW-10 dataset. The dataset contains all together 13,045 cropped images of words or text lines in 10 classes. 



Fig. 2. Examples of images that we harvested from Google Street View, 
along with the annotations. 


from 10 languages: Arabic, Chinese, English, Greek, Hebrew, 
Japanese, Korean, Russian, Thai and Tibetan. Hence, we call 
this benchmark Script Identification in the Wild 10 Classes 
(SIW-10) dataset. 

We first harvest a collection of street view images from 
the Google Street View and manually label the text regions 
by their bounding boxes (Figure 2). Text line images are 
then cropped from these images. For each language category, 
600 to 1,000 street view images are collected and 1,000 to 
2,000 text regions are extracted. In our dataset, we include 
only horizontal text images that contain one or several words. 
The dataset contains 13,045 cropped images of words or text 
lines. Among them 5,000 are used for testing and the rest 
8,045 are used for training. 

Note that the SIW-10 dataset is diverse and challenging. The 
images are captured from different locations all over the world, 
under different imaging conditions. Majority of the images 
are with low resolution, noise or blur. We believe the SIW- 
10 dataset can be serve as a standard benchmark for script 
identification in the wild. 

III. Our Approach 

Considering the challenges discussed in Section I, it would 
be desirable if our approach is able to capture distinctive 


characters or components in the text images, and able to deal 
with inputs with arbitrary aspect ratios. 

Recent research on image classification has seen a leap 
forward, thanks to the wide applications of deep convolutional 
neural networks (CNNs, [16]). CNNs are designed to operate 
on input maps with fixed widths and heights. Text images, 
however, come in arbitrary sizes and aspect ratios. Their aspect 
ratios vary greatly, depending on the number of characters they 
contain. 

To utilize the power of automatic feature learning in CNNs, 
while making it fit to the script identification problem, in the 
following we propose Multi-stage Spatially-sensitive Pooling 
Network (MSPN), a novel variant of CNN, for the script 
identification task. The network efficiently captures rich and 
distinctive features in text images for script identification, 
while inherently and naturally deals with input images with 
arbitrary aspect ratios. 

A. Architecture 

The architecture of the network is depicted in Figure 4. As 
the preprocessing step, the input images are resized to have 
fixed height (32 pixels throughout our experiments), keeping 
their aspect ratios. The first four convolutional layers (with 
max-pooling and rectifier layers) in the network work the 
same way as they do in a CNN, except that the sizes of the 
input maps are arbitrary. Since convolutional layers apply filter 
banks to all places in an image, these layers are inherently 
capable of dealing with images in arbitrary sizes. The sizes 
of their output maps change with the input sizes. In our 
network, images are fixed in height. Therefore the response 
maps produced by these convolutional layers are fixed in 
height but varied in length. In our settings, output maps have 
heights 15, 7, 3, 1 respectively and widths proportional to the 
width of the input image. These layers aim to capture rich, 
hierarchical features from raw image pixels. 

In a conventional CNN structure, response maps output 
by the last convolutional layer are flattened and fed to the 
following fully-connected layers or locally-connected layers. 
These kinds of layers can only accept inputs with fixed number 
of dimensions. They are not suitable for our problem due to 
that text images vary in lengths. Besides, the discriminative 
features in a text image can appear at any horizontal positions, 
since that characters are arranged in different orders, but their 









































Fig. 4. MSPN architecture illustrated following the style used in [12]. The four cuboids after convolutional stages represent response maps. The size for 
each of the cuboids corresponds to map width x map height x number of maps. The cuboids inside represent convolutional kernels. The spatially-sensitive 
pooling layers are indicated by arrows in three different colors. The output is the softmax probability vector of length 10. See Section III-A for details. 


vertical positions still matter. To address these problems, we 
propose the spatially-sensitive pooling layer, which accepts 
input maps with arbitrary widths, and captures topological 
information along vertical directions. 

B. Spatially-sensitive pooling layer 

As have discussed in Section II, distinctive parts play 
an important role in script identification. Typically, text is 
a collection of characters arranged in a line. A distinctive 
character or character component may appear at any horizontal 
position in the image, so that their horizontal positions are 
less informative. But their vertical positions still matter as 
characters are written upright. On the other hand, text images 
are in arbitrary aspect ratios. Conventional CNNs cannot 
deal with them directly. We propose the Spatially-Sensitive 
Pooling (SSP) layer, which captures useful features for script 
identification, as well as deals with arbitrary input sizes. 

The SSP layer takes input maps that have a fixed number of 
rows but a variable number of columns. For each of the input 
maps, the SSP layer pools along each row of the map by taking 
the maximum or average value in each row. Assuming that the 
input is a tensor of sizes n map x w x h, where n map is the 
number of input maps, w, h are the width and height of the 
maps. Then the output would be a vector of length n map /i, 
which is independent on the width of the input images. 

The SSP layer introduces invariance to horizontal positions 
of responses to the network. Meanwhile it keeps vertical 
positions of the responses. This makes it suitable for describing 
images of texts in a line. Furthermore, SSP layers accept 
input maps with arbitrary widths and output vectors with 
fixed lengths, thus capable of acting as the bridge between 
the convolutional layers and fully-connected layers. As a 
consequence, the network is able to deal with images with 
arbitrary sizes and aspect ratios inherently and naturally. 

C. Multi-stage pooling 

The distinctive features could be at different abstraction 
levels. Both higher level features and lower level features may 
help the recognition process. To utilize features at different 


abstraction levels, we introduce a multi-stage pooling scheme 
into the network. As illustrated in Figure 4. The colored lines 
whose arrowheads start from conv2, conv3 and conv4 layers 
indicate SSP layers that are inserted after these convolutional 
layers. The output of the three pooling layers are concatenated 
as a long vector, which is fed to later fully-connected layers. 
Thereafter, the pooling features in the long vector contain 
pooling features from all the three SSP layers. Since layers 
conv2, conv3 and conv4 output response maps at different 
abstraction levels, the features concatenated from their pooling 
features describe the text image in both high-levels and low- 
levels. The resulting features are rich and describe multiple 
aspects of the text image, which is desirable for script identi¬ 
fication. 

The multi-stage pooling scheme results in a graph struc¬ 
tured network that is more complex than simple sequential 
structured network. In this network, convolutional layers take 
errors back-propagated from multiple layers. For example, 
conv2 receives errors back-propagated from both conv3 and 
fcl. These convolutional layers (conv2 and conv3 in our 
network) are trained with respect to gradients on pooling 
features as well as response maps. This potentially encourages 
these layers to produce more discriminative response maps, 
due to that their outputs are directly used for classifying after 
pooling. Therefore, the network benefits from the the rich and 
discriminative features. 

IV. Experiments 

We evaluate our MSPN on the collected SIW-10 dataset. 
Besides, we implement a baseline algorithm using the con¬ 
ventional CNN, with some simple workarounds to bypass 
the variable aspect ratio problems. As another baseline, we 
implement the Locality-constrained Linear Coding (LLC) [11] 
algorithm, which is widely used in image classification tasks. 
We compare the performances of these approaches with the 
proposed approach on the SIW-10 dataset. 

A. Dataset and Implementation Details 

1) Dataset: We evaluate our algorithms on the SIW-10 
dataset. We build the testing set with all together 5,000 
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images with 500 testing images for each class. The rest 8,045 
images are for training. The number of training images for all 
the script classes are respectively: Arabic 503, Chinese 809, 
English 725, Greek 522, Hebrew 770, Japanese 717, Korean 
1064, Russian 532, Thai 1726 and Tibetan 677. For all classes 
the number of testing samples are 500. 

2) Implementation details: We train the network using 
stochastic gradient descent (SGD) [17] with the initial learning 
rate set to 0.01 and the momentum set to 0.9. Following [12], 
the learning rate is decreased by a factor of 0.1 when the 
validation error plateaus. The training terminates once the 
learning rate is less than 1 x 10 -5 . 

B. Baseline Methods 



1) CNN-Simple: We setup a convolutional neural network 
to identify the scripts from text images. The CNN structure 
we use is similar to our MSPN in the convolutional parts. 
CNN cannot deal with inputs with arbitrary sizes. To address 
this problem, we first sample patches in the input images and 
resize them to a fixed size, as illustrated in Figure 4. Patches 
are set to labels that are same to the image they are cropped 
from. In our implementation, text images are first resized to 
have heights of 40 and squared patches with sizes randomly 
chosen within range [25,40]. All the patches are then resized 
to 32 x 32 after sampling. The CNN is trained on the patches 
sampled from training images, with the same scheme described 
in Section IV-A2. 

During the testing process, we first predict the labels of 
patches using the trained CNN. Their classification probability 
vectors p output by the CNN soft-max layer are then used to 
predict the class of the whole text image. Denote the set of 
patches cropped from text image /W by where 

rii is the number of patches sampled from image /M. The 
prediction is done by: 

pf = CNN(xj). (1) 

In order to get the final prediction on the whole image, 
we combine the predictions on all patches by calculating the 
average of their probabilities: 


y 




3 = 1 


( 2 ) 


The resulting yW i s taken as the multi-class score vector 
for text image I^\ The prediction is made by choosing the 
class with the max score in yW. 

2) LLC: The FFC [11] is a widely adopted image 
classification algorithm. It is based on the Bag-of-Words 
model (BoW), max-pooling and spatial pyramid matching 
(SPM) [18]. We densely sample SIFT descriptors at 3 different 
scales. A part of them are used to build a codebook with 
2048 codewords. Training and testing images are coded using 
the FFC coding scheme. The SPM is applied by vertically 
dividing text images into respectively 2 and 3 subregions 
with equal heights. The resulting coding features are in 
(1 + 2 + 3) x 2048 = 12288 dimensions. Finally, a linear 
SYM [19] classifier is learned on the coding features. 


Fig. 5. Comparisons on recognition accuracies among CNN-Simple, LLC 
and MSPN, evaluated on the SIW-10 dataset. 




0.6 

0.5 

0.4 

0.3 



Fig. 6. Prediction confusion matrix on SIW-10 made by MSPN. On Y- 
axis are ground truth labels and on X-axis are predictions. Mis-classifications 
frequently happen between Chinese and Japanese; Russian and Greek; Russian 
and English, etc. 


C. Evaluation 

We evaluate our MSPN as well as the baseline methods on 
the SIW-10 dataset. The prediction accuracies for each class 
is evaluated and compared in Figure 5. It can be seen that our 
approach achieves the best results on all classes and reduces 
the error of FFC by a large margin. 

From the confusion matrix shown in Figure 6 we can 
see that mis-classifications frequently happen between several 
pairs of scripts, e.g. Chinese and Japanese. The reason is 
most likely to be that these languages share a large proportion 
of alphabets. Some words are even indistinguishable without 
semantic information, which we have not yet incorporated into 
our framework. 

The CNN-Simple approach performs significantly worse 
than the MSPN. The reasons could be that 1) in CNN-Simple, 
the network is not directly optimized with respect to a loss 
function that corresponds to the final classification accuracy, 
and 2) CNN-Simple does not sufficiently exploit discriminative 
features. It simply combines the results from all the sampled 
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Fig. 7. Some misclassified samples. The misclassification are mainly due to 
the shared characters between languages (e.g. Japanese vs. Chinese), unusual 
layout and cluttered background. 

patches, some of which might be uninformative or misleading. 

LLC performs better than CNN-Simple. One of the reason 
is that Bag-of-Words models inherently deal with inputs with 
arbitrary sizes. MSPN outperforms LLC by a large margin. 
Apart from the reason that deep models are stronger learner, 
MSPN differs from LLC in that it learns task-specific filters 
and implicitly extracts discriminative features. Furthermore, 
compared to LLC, MSPN learns a much more compact image- 
level descriptor (3456 vs 12288 dimensions), which further 
shows the superiority of our approach. 

Figure 7 shows some failure cases. From it one can see 
that our method might fail under cases of unusual layout, 
blurry text and ambiguity (sometimes Japanese words are 
written entirely in Chinese characters and there is no way to 
distinguish them other than using semantics). 

D. Discussion 

In this section we discuss the effect of multi-stage pooling. 
To verify the effectiveness of multi-stage pooling, we construct 
several network variants by removing part of the pooling 
layers. Table I shows the configurations of the variants and 
their corresponding recognition accuracies. We can see from 
the table that the performances of the pooling layers are close 
when they are used separately. Significant performance gain 
can be observed when they are combined. This demonstrate 
the effectiveness of multi-stage pooling. 

V. Conclusion 

We have presented an effective algorithm for script identifi¬ 
cation in real-world scenarios. The proposed algorithm is able 
to better exploit the properties of texts in natural images. More¬ 
over, we collected and released a large-scale benchmark for 
performance evaluation and comparison. The experiments on 
this dataset demonstrate that the proposed algorithm achieves 
higher performance than conventional approaches, including 
the original CNN method and LLC. 

In this paper, we have only performed script identification 
in cropped word images. In future work, we plan to investigate 


TABLE I 

Multi-stage pooling configurations for MSPN and its 

VARIANTS. “SSP-3” INDICATES THAT THE NETWORK VARIANT ONLY USES 
SPATIALLY-SENSITIVE POOLING LAYER-3 (SSP-3) IN FIGURE 4, “SSP-2 + 
SSP-3” INDICATES THE NETWORK VARIANT THAT USES BOTH SSP-2 AND 

ssp-3 (*In Variant- 1 the number of hidden nodes in fc2 is set to 

512) 


Variant 

Configurations 

Average Error (%) 

Variant-1 

ssp-1 

7.3 

Variant-2 

ssp-2 

7.8 

Variant-3 

ssp-3* 

8.0 

Variant-4 

ssp-2 + ssp-3 

7.4 

Variant-5 

ssp-1 + ssp-2 

6.6 

MSPN 

ssp-1 + ssp-2 + ssp-3 

5.6 


approaches that can recognize the language type of texts from 
full natural images. This direction would be more promising 
and practical, because in reality the input to the script iden¬ 
tification system is more likely to be full images, instead of 
cropped images. 
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